提醒: 本篇文章的code在這裡
所謂自然語言處理,就是希望可以讓電腦讀懂人類的文字。不過,這篇文章只會處理已經存成文字檔的文字,暫時不會提到手寫文字辨識、語音辨識、翻譯等功能。而單單處存成文字檔的文字,你或許很難理解,讓電腦讀懂有什麼用處,大約有以下可能的發展方向。
如果大家有注意到,這篇文章講的是英文的自然語言處理。之所以要分開來說是因為中英文的自然語言處理技術,在基礎上面困難點有很大的不同。英文自然語言處理技術上,比較難的是stemming,也就是把started變成start或是把eats變成eat,不過目前套件表現就已經滿不錯了。中文的主要難點在tokenize(斷詞),由於中文詞彙變化多端,也不像英文直接用空白鑑分隔,所以斷詞上必須透過一些演算法去處理,套件的使用上開源社群比較常使用jieba,聽說中研院有做一套出來,但是幾次聽了相關領域的老師或是講者的反饋,大多
import pandas as pd
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
stops = stopwords.words('english')
from string import punctuation
這個動詞的意思就是,把一個句子拆成一個個的單字。以下示範nltk中的兩種tokenize的方式。
testStr = "This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None."
tokens = nltk.word_tokenize(testStr)
print(tokens)
tokens = nltk.wordpunct_tokenize(testStr) ## 請注意,差異在cut-off
print(tokens)

stemming和lemmatize是一個把所有不同時態或是不同變化相同的字變成同一個字。而stemming比較像是去掉ed或是s這種添加在字後面的小字母,lemmatize則是字根化,就是把字還原到字根的型態。以下讓我們來看一下示範。
df = pd.DataFrame(index = tokens)
df['porter_stemmer'] = [porter_stemmer.stem(t) for t in tokens]
df['lancaster_stemmer'] = [lancaster_stemmer.stem(t) for t in tokens]
df['snowball_stemmer'] = [snowball_stemmer.stem(t) for t in tokens]
df['wordnet_lemmatizer'] = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
df
| porter_stemmer | lancaster_stemmer | snowball_stemmer | wordnet_lemmatizer | |
|---|---|---|---|---|
| This | Thi | thi | this | This | 
| value | valu | valu | valu | value | 
| is | is | is | is | is | 
| also | also | also | also | also | 
| called | call | cal | call | called | 
| cut | cut | cut | cut | cut | 
| - | - | - | - | - | 
| off | off | off | off | off | 
| in | in | in | in | in | 
| the | the | the | the | the | 
| literature | literatur | lit | literatur | literature | 
| . | . | . | . | . | 
| If | If | if | if | If | 
| float | float | flo | float | float | 
| the | the | the | the | the | 
| parameter | paramet | paramet | paramet | parameter | 
| represents | repres | repres | repres | represents | 
| a | a | a | a | a | 
| proportion | proport | proport | proport | proportion | 
| of | of | of | of | of | 
| documents | document | docu | document | document | 
| integer | integ | integ | integ | integer | 
| absolute | absolut | absolv | absolut | absolute | 
| counts | count | count | count | count | 
| . | . | . | . | . | 
不過在前處理上,我們除了會使用tokenize配上stemming或是lemmatize之外,還會把英文字轉乘小寫,看句子的長度決定要不要把停用字跟標點符號拿掉。
|         |           |             |          |         |         |          |            |             |
|---------|-----------|-------------|----------|---------|---------|----------|------------|-------------|--------
| i       |  me       |  my         |  myself  |  we     |  our    |  ours    |  ourselves |  you        |  your
| yours   |  yourself |  yourselves |  he      |  him    |  his    |  himself |  she       |  her        |  hers
| herself |  it       |  its        |  itself  |  they   |  them   |  their   |  theirs    |  themselves |  what
| which   |  who      |  whom       |  this    |  that   |  these  |  those   |  am        |  is         |  are
| was     |  were     |  be         |  been    |  being  |  have   |  has     |  had       |  having     |  do
| does    |  did      |  doing      |  a       |  an     |  the    |  and     |  but       |  if         |  or
| because |  as       |  until      |  while   |  of     |  at     |  by      |  for       |  with       |  about
| against |  between  |  into       |  through |  during |  before |  after   |  above     |  below      |  to
| from    |  up       |  down       |  in      |  out    |  on     |  off     |  over      |  under      |  again
| further |  then     |  once       |  here    |  there  |  when   |  where   |  why       |  how        |  all
df = pd.DataFrame(index = [t for t in tokens if t not in stops])
df['porter_stemmer'] = [porter_stemmer.stem(t.lower()) for t in tokens if t not in stops]
df['lancaster_stemmer'] = [lancaster_stemmer.stem(t.lower()) for t in tokens if t not in stops]
df['snowball_stemmer'] = [snowball_stemmer.stem(t.lower()) for t in tokens if t not in stops]
df['wordnet_lemmatizer'] = [wordnet_lemmatizer.lemmatize(t.lower()) for t in tokens if t not in stops]
df
| porter_stemmer | lancaster_stemmer | snowball_stemmer | wordnet_lemmatizer | |
|---|---|---|---|---|
| This | thi | thi | this | this | 
| value | valu | valu | valu | value | 
| also | also | also | also | also | 
| called | call | cal | call | called | 
| cut | cut | cut | cut | cut | 
| literature | literatur | lit | literatur | literature | 
| If | if | if | if | if | 
| float | float | flo | float | float | 
| parameter | paramet | paramet | paramet | parameter | 
| represents | repres | repres | repres | represents | 
| proportion | proport | proport | proport | proportion | 
| documents | document | docu | document | document | 
| integer | integ | integ | integ | integer | 
| absolute | absolut | absolv | absolut | absolute | 
| counts | count | count | count | count | 
df_tag = pd.DataFrame(index = tokens)
df_tag['default'] = nltk.pos_tag(tokens)
df_tag['universal'] = nltk.pos_tag(tokens, tagset='universal')
df_tag
| default | universal | |
|---|---|---|
| This | DT | DET | 
| value | NN | NOUN | 
| is | VBZ | VERB | 
| also | RB | ADV | 
| called | VBN | VERB | 
| cut | VBN | VERB | 
| - | : | . | 
| off | RB | ADV | 
| in | IN | ADP | 
| the | DT | DET | 
| literature | NN | NOUN | 
| . | . | . | 
| If | IN | ADP | 
| float | NN | NOUN | 
| ",","," | . | |
| the | DT | DET | 
| parameter | NN | NOUN | 
| represents | VBZ | VERB | 
| a | DT | DET | 
| proportion | NN | NOUN | 
| of | IN | ADP | 
| documents | NNS | NOUN | 
| integer | NN | NOUN | 
| absolute | NN | NOUN | 
| counts | NNS | NOUN | 
| . | . | . |